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Abstract 

Page  segmentation  and  zone  classification  are  key  areas  of  research  in  docnment  image  pro¬ 
cessing,  becanse  they  occnpy  an  intermediate  position  between  docnment  preprocessing  and 
higher-level  docnment  nnderstanding  snch  as  logical  page  analysis  and  OCR.  Snch  analysis  of 
the  page  relies  heavily  on  an  appropriate  docnment  model  and  resnlts  in  a  representation  of  the 
physical  strnctnre  of  the  docnment.  The  pnrpose  of  this  review  is  to  analyze  progress  made  in 
page  segmentation  and  zone  classihcation  and  snggest  what  needs  to  be  done  to  advance  the 
held. 
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1  Introduction 


Paper-based  documents  contain  information  in  various  forms  such  as  text,  graphics,  pictures, 
mathematical  formulas,  and  tables.  To  fuUy  “understand”  each  type  of  information,  it  is 
necessary  to  apply  domain-specihc  analysis  techniques  specihc  to  that  document  type.  Initially, 
however,  a  scanned  image  of  the  document  must  be  divided  or  segmented  into  homogeneous 
regions  and  each  region  should  be  classihed  so  that  appropriate  analysis  can  be  applied — for 
example,  so  that  graphics  can  be  vectorized,  pictures  can  be  compressed,  and  text  can  be 
segmented  into  lines,  words,  and/or  characters  and  recognized.  Segmentation  and  classihcation 
are  therefore  of  great  importance  for  document  image  processing  and  its  applications,  because 
they  dehne  a  baseline  for  the  whole  process  of  information  conversion  to  digital  form. 

A  great  deal  of  work  has  been  done  on  page  segmentation  and  zone  classihcation  and  various 
methods  have  been  proposed  (see  the  earlier  surveys  [36,  89]).  The  purpose  of  this  paper  is 
to  survey  existing  methods,  to  highhght  special  features,  and  to  suggest  tasks  that  will  lead 
to  better  results  in  future  applications.  This  work  does  not  describe  the  methods  in  detail, 
but  rather  assumes  that  a  brief  and  simple  introduction  to  the  subject  is  often  more  helpful 
than  reading  many  papers  and  trying  to  understand  the  authors’  thoughts.  Here,  we  focus 
on  geometrical  layout  analysis  without  considering  logical  layout  analysis.  Topics  such  as  the 
division  of  pages  into  headers,  footers,  title,  and  abstracts,  and  how  text  regions  are  related  to 
each  other,  are  not  directly  within  the  scope  of  this  paper. 

We  will  review  page  segmentation  and  zone  classihcation  methods  for  document  images 
consisting  of  text,  binary  graphics,  and  binarized,  halftone  or  color  pictures.  Several  examples 
of  such  images  are  shown  in  Fig.  f. 

This  paper  has  the  following  structure.  Section  2  describes  the  basic  elements  that  typically 
appear  in  document  images  and  common  image  types.  Section  3  briehy  considers  various  classes 
of  document  images,  tasks  specihc  for  each  class,  and  known  difhculties  in  their  layout  analysis. 
Section  4  presents  the  state  of  the  art  of  page  segmentation  and  zone  classihcation  methods. 
Section  5  presents  our  view  on  the  state  of  the  art  and  overviews  related  document  image 
analysis  tasks  such  as  document  compression,  representation  and  benchmarking  of  document 
layout  analysis  algorithms.  Section  6  concludes  the  paper. 
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Figure  1:  Examples  of  document  images 
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2  Basic  characterization  of  document  images 


Documents  are  usually  scanned  and  represented  as  binary  (2  bits  per  pixel),  gray-scale  (typically 
8  bits  per  pixel),  or  color  (typically  8-24  bits  per  pixel)  images.  The  type  of  image  used  is 
application-dependent  and  transformations  between  types,  such  as  color  to  gray-scale  (when  a 
color  image  is  split  into  three  grey-scale  images  corresponding  to  three  color  planes  (R,G,B)); 
color  to  binary  (when  applying  edge  detection  directly  to  a  color  image);  and  gray-scale  to 
binary  (when  using  binarization),  are  often  done  in  order  to  speed  up  computations. 

Basic  elements  or  entities  that  can  be  present  in  document  images  include  but  are  not 
limited  to: 

•  Text  (characters  and  digits  in  the  text  body,  titles,  headings,  cells  of  tables,  hgure  cap¬ 
tions)  nested  in  pictures  and  graphics  as  annotations. 

•  Tables  (with  and  without  ruling  hnes  as  held  separators) 

•  Mathematical  expressions 

•  Binarized,  halftone,  and  color  pictures 

•  Graphics  (how  charts,  Une  drawings,  plots,  diagrams,  logos,  etc.). 

For  some  regions  that  contain  text,  it  is  often  unclear  whether  to  classify  them  as  text  or  as 
non-text.  For  example,  we  may  wish  to  treat  the  entire  region  as  non-text,  if  the  text  is  nested 
in  or  semantically  close  to  pictures  or  graphics.  On  the  other  hand,  it  can  be  useful  to  process 
regions  as  text  when  this  explains,  for  example,  a  line  drawing.  In  most  cases,  even  if  text  is 
initially  classihed  as  non-text,  it  can  be  extracted  as  part  of  zone-specihc  processing. 

Geometric  document  layout  and  skew  are  two  important  aspects  of  a  document  image.  A 
document’s  layout  is  the  way  in  which  document  elements  (text,  pictures,  graphics,  etc.)  are 
arranged  on  the  document  image.  There  are  three  basic  types  of  layouts:  Manhattan  layout, 
where  regions  are  constrained  polygons  whose  boundaries  are  straight  horizontal  and  vertical 
lines;  rectangular  layout,  which  is  a  specihc  case  of  Manhattan  layout;  and  arbitrary  layout, 
where  boundaries  form  unconstrained  polygons  or  overlapping  regions.  Document  skew  is  the 
slant  of  the  document  image  with  respect  to  the  primary  orientation  of  the  page.  Skew  is 
most  often  introduced  by  improper  positioning  of  the  document  on  a  scanning  device.  In  many 
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cases,  this  negatively  influences  the  performance  of  page  segmentation  and  zone  classification 
methods. 

Most  layout  analysis  methods  segment  and  classify  the  image  into  three  basic  classes:  text, 
graphics,  and  pictures.  Mathematical  expressions  may  be  initially  classified  as  text,  and  tables 
may  be  classified  as  graphics.  When  doing  this,  it  is  assumed  that  a  finer  classification  will 
be  done  at  the  document  image  understanding  stage,  where  the  text  is  divided  into  logical 
components. 

3  Document  classes  and  applications 

ft  is  highly  unlikely  that  a  single  generic  method  will  be  developed  that  can  process  all  classes 
of  documents  because  of  their  variety  and  complexity,  although  a  number  of  methods  designed 
to  analyze  the  images  of  several  different  classes  of  documents  have  been  proposed,  ft  appears 
most  useful  to  begin  with  classification  of  documents  by  their  geometrical  layouts  so  that 
when  new  applications  arise,  we  can  evaluate  what  layout  analysis  methods  can  be  employed 
based  on  properties  of  the  target  class.  Although  we  do  not  consider  the  following  Hst  to  be 
comprehensive,  nor  do  we  consider  each  class  to  be  exclusive,  this  taxonomy  of  document  genres 
adequately  highlights  the  held: 

1.  structured  articles  (in  journals,  newspapers,  and  newsletters), 

2.  documents  with  unconstrained  layout  (advertisements,  cover  and  title  pages  of  CDs, 
books,  and  journals), 

.3.  semi-structured  layouts  (envelopes,  post  and  business  cards,  bank  checks,  forms,  and 
table-hke  documents), 

4.  maps  and  engineering  drawings, 

5.  non-traditional  documents  (WWW  pages  and  video  frames). 

In  this  section,  we  will  give  a  brief  overview  of  the  main  segmentation  and  classification  tasks 
to  be  addressed  for  instances  of  each  class  mentioned  above  together  with  known  problems. 

3.1  Structured  articles 

Journals,  newspapers  and  newsletters  are  a  primary  source  of  information  and  are  widely  pub¬ 
lished  and  easily  accessible.  Their  images  can  contain  zones  of  many  types  (text  of  different 
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font  styles/sizes,  various  types  of  graphics,  binarized,  halftone,  and  color  pictures)  and  can  be 
scanned  as  binary,  gray-scale  or  color.  Zones  are  typically  physically  separated  from  each  other; 
however,  the  gaps  between  them  can  be  very  narrow.  In  some  cases,  text  can  be  embedded 
into  graphics  or  pictures,  and  can  be  either  darker  (normal  printing)  or  lighter  (inverse  print¬ 
ing)  than  the  background.  Articles  in  journals,  newspapers,  and  newsletters  are  examples  of 
structured  documents,  because  aU  elements  on  a  page  are  typically  ordered  and  linked  together 
based  on  general  rules.  Examples  of  such  rules  include:  the  reading  order  of  text  blocks  is  from 
top  to  bottom,  left  to  right;  hgure  captions  are  located  after  the  corresponding  hgures;  text 
data  form  relatively  large  regions;  hgures  or  tables  appear  only  after  references  to  them  in  the 
text.  Newspapers  have  more  complex  and  less  structured  logical  organization  than  journals 
or  newsletters.  They  can  be  typically  described  with  the  Manhattan  layout  but  can  also  have 
arbitrary  (unconstrained)  layout.  Cover  pages  of  technical  journals  may  have  a  table- like  struc¬ 
ture,  but  they  are  typically  weakly  constrained  for  non-technical  magazines.  Although  these 
models  can  be  language-  or  even  publication-specihc,  they  are  typically  quantihable.  The  main 
tasks  to  be  addressed  include  skew  estimation  and  correction  (sometimes  optional),  document 
segmentation  into  homogeneous  regions,  and  classihcation  of  these  regions  as  background,  text, 
graphics,  and  pictures  (see,  for  example,  [7,  20,  27,  30,  34,  53]  for  recent  work;  other  papers  are 
cited  in  Section  4).  Text  lines  with  no  skew  are  typically  horizontal  or  vertical,  but  they  are 
slanted  with  respect  to  the  X-  or  Y-axis  if  skew  is  present.  Usually  aU  regions  on  the  image 
have  the  same  skew  (global  skew),  but  cases  where  some  regions  have  different  orientations  than 
others  are  also  possible. 

Despite  the  rather  structured  layout  in  many  cases,  the  major  problems  are  due  to  different 
image  types,  application  conditions  that  must  be  satished  (skew-,  layout-,  script-independence), 
and  noise. 

3.2  Unconstrained  layouts 

The  term  “unconstrained  layout”  reflects  a  lack  of  general  rules  when  dehning  the  documents 
of  this  class.  More  generally,  the  layout  of  such  documents  depends  on  their  designer’s  goal. 

3.2.1  Advertisements 

Advertisements  are  usually  printed  in  magazines  or  newspapers,  but  they  can  also  be  placed  on 
the  Web  as  electronic  documents.  Their  layout  is  typically  more  arbitrary  than  that  of  journal 
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and  newspaper  articles.  Mnltiple  skew  for  text  and  non-text  regions,  cnrved  text  lines  whose 
characters  are  not  aUgned  along  a  straight  line,  and  text  nested  in  or  tonching  pictnres  often 
occnr  in  order  to  emphasize  important  information  and  to  attract  the  reader’s  attention  [42,  48]. 
In  short,  advertisements  are  nnstrnctnred  docnments. 

3.2.2  Cover  and  title  pages 

The  images  of  these  docnments  are  often  in  color  and  they  can  contain  arbitrarily  oriented 
text  of  a  large  variety  of  a  priori  nnknown  font  sizes  printed  on  a  complex  color  or  textnred 
backgronnd. 

Layont  analysis  of  images  of  this  class  reqnires  text  extraction  and  recognition  [17,  37,  48, 
68,  84,  108,  109],  which  is  nsefnl  for  information  retrieval.  Text  qneries  are  easier  to  formnlate 
than  ones  described  by  non-text  featnres  snch  as  shape,  textnre,  and  color. 

The  challenges  are  dne  to  arbitrary  docnment  layont  and  complex  color  backgronnd,  which 
makes  accnrate  text  detection  difficnlt. 

3.3  Semi-structured  layouts 

Semi-structured  layouts  have  structural  elements  in  which  a  user  should  enter  information  or 
which  conhne  the  entered  data,  but  there  are  typically  no  limitations  on  the  locations  of  these 
elements.  This  means  that  two  documents  belonging  to  this  class  can  have  different  layouts, 
although  they  may  both  contain  the  same  structural  elements. 

3.3.1  Postal  envelopes  and  bank  checks 

Postal  correspondence  and  bank  checks  are  examples  of  semi-structured  documents  since  they 
have  predehned  sets  of  helds  such  as  the  address  block  or  the  amount  of  money,  but  the 
locations  of  these  helds  within  the  document  are  subject  only  to  convention.  Business  cards 
are  also  included  in  this  class  because  they  have  a  similar  logical  structure.  The  images  of  this 
class  of  documents  may  be  binary,  gray-scale  or  color  and  text  may  be  printed  on  a  complex 
textured  background. 

The  main  tasks  deal  with  the  identihcation  of  different  helds  which  are  specihc  to  given 
types  of  documents.  For  bank  checks,  they  may  include:  signature,  check  number,  date,  cour¬ 
tesy  amount  (amount  of  payment  in  a  numeric  format),  legal  amount  (amount  of  payment  in 
a  character  format),  account  number,  payee’s  name,  address  of  hnancial  institution,  and/or 
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logos  [1,  18,  25,  40,  54,  57,  85,  105].  For  business  cards,  they  may  include  holder’s  name,  af- 
hliation,  and  address  [19,  97].  For  postal  correspondence,  the  address  block  or  blocks,  stamp, 
bar  code,  and  postage  paid  indications  should  be  identihed  [22,  69,  92,  96,  99,  101,  104].  In 
systems  such  as  that  used  in  the  British  Postal  Office,  a  stamp  value  is  also  read  for  revenue 
protection  [69]. 

The  major  problems  result  from  1)  a  complex  background,  making  it  difficult  to  binarize 
or  segment  the  image,  2)  a  mixture  of  printed  and  handwritten  text  touching  or  intersecting 
the  ruUng  lines,  .3)  changes  in  illumination,  4)  arbitrary  document  orientation  on  a  moving 
platform,  and  5)  restrictions  on  processing  time. 

3.3.2  Forms  and  table-like  documents 

Forms  and  table-like  documents  such  as  questionnaires  or  invoices  consist  of  helds  or  cells  in 
which  handwritten  or  printed  data  should  be  entered.  The  initial  images  are  usually  binary, 
although  color  forms  exist  too.  In  the  latter  case,  however,  the  image  background  is  uniform 
so  that  it  does  not  seem  to  be  very  difficult  to  separate  text  from  it.  In  addition  to  text, 
invoices  can  also  contain  small  pictures  such  as  logos.  Forms  and  table-like  documents  are 
semi-structured  documents  with  limitations  imposed  by  different  separators  on  the  location  of 
text  data. 

The  basic  tasks  to  be  solved  here  are  held  isolation  and  removal  of  response  or  bounding 
boxes  or/and  ruUng  lines  to  extract  the  text  inside  each  held  [9,  14,  21,  38,  43,  61,  67,  77,  82, 
90,  100,  107].  Invoices  often  also  include  the  company’s  address  and  bank  account  location  [13, 
15,  55],  making  processing  them  somewhat  similar  to  processing  of  bank  checks. 

There  are  two  types  of  form  analysis.  The  hrst  is  known  form  analysis,  where  a  given  form 
belongs  to  one  of  a  set  of  known  classes  and  a  blank  template  (model)  with  empty  helds  or  cells 
is  available  for  it.  In  this  case,  the  hrst  task  is  to  identify  the  form  by  using  specihc  labels  or 
structure  that  can  be  extracted  from  the  model.  Once  the  form  type  is  identihed,  the  locations 
of  data  helds  in  terms  of  their  coordinates  become  known. 

The  second  type  is  unknown  form  analysis,  where  no  prior  knowledge  about  the  form  type 
is  used  and  an  algorithm  extracts  a  form  structure  based  on  separators  between  cells  or  helds. 
The  second  type  is  more  hexible,  but  it  may  be  more  computationally  expensive  and  sometimes 
less  accurate. 

Problems  arise  when  the  text  data  are  written  outside  predehned  boxes  or  helds;  this  makes 
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it  difficult  to  locate  them  properly.  Image  transformations  (rotation,  scaling  and  translation) 
also  complicate  the  process  of  matching  a  predehned  model  to  the  input  image.  For  invoices 
sent  through  fax  machines  from  sellers  to  customers,  the  corresponding  images  often  have  poor 
quaUty  due  to  the  lossy  compression  used  in  fax  transmission.  Stamps  on  text  areas  can  also 
negatively  affect  text  data  extraction  for  such  documents. 

3.4  Maps  and  drawings 
3.4.1  Maps 

Map  images  usually  require  gray-scale  or  color  to  capture  detail.  Color  maps  consist  of  several 
color  layers,  where  each  layer  corresponds  to  a  particular  type  of  information.  Text  data  (names 
of  countries,  cities,  lakes,  . . . )  are  embedded  into  or  surrounded  by  graphics  (roads,  different 
kinds  of  textures,  . . . ). 

Text  data  on  maps  and  drawings  forms  small  and  sparse  groups  arbitrarily  mixed  with 
graphic  elements.  Though  there  are  some  rules  for  such  “mixtures”,  they  are  only  valid  for 
a  narrow  family  of  maps,  which  does  not  allow  one  to  generalize  them  and  to  assume  that 
their  organization  is  structured.  In  fact,  text  may  appear  everywhere  and  text  lines  may  have 
arbitrary  orientation. 

Although  there  is  a  wide  variety  of  types  of  maps  (cadastral,  hydrographical,  topographical, 
etc.),  analysis  of  map  images  usually  consists  of  three  tasks  [4,  28,  60,  65,  71,  76,  79,  87,  91, 
94]:  1)  text /graphics  separation,  2)  label  assignment  to  characters  and  special  symbols,  and 
3)  graphics  vectorization.  For  color  images,  color  separation  is  often  done  during  scanning 
by  using  a  special  scanner  designed  for  this  purpose.  Characters  and  symbols  are  separated 
from  graphics  at  each  color  layer  by  their  sizes  and  they  are  fed  to  recognition  modules,  while 
graphics  are  thinned  and  vectorized. 

The  large  variety  of  maps  leads  to  a  large  variety  of  methods  and  systems,  each  of  which 
is  usually  capable  of  processing  only  one  map  type.  Among  the  problems  are  characters  and 
symbols  touching  or  intersecting  graphics,  slanted  or  curved  text  strings,  variable  gaps  between 
adjacent  characters  belonging  to  the  same  string,  different  font  sizes  (though  it  is  possible  to 
know  in  advance  the  fonts  used  in  map  creation),  and  some  graphical  symbols  that  may  be 
similar  to  characters. 


3.4.2  Engineering  drawings 

Engineering  drawings  have  some  challenges  in  common  with  maps  in  that  the  docnments  of  both 
classes  contain  small  and  sparse  text  data  among  larger  graphic  regions.  Unlike  maps,  however 
the  images  of  engineering  drawings  are  typically  binary  or  gray-scale,  are  more  sparse,  and  have 
tighter  geometric  constraints.  Geometrical  layont  analysis  of  engineering  drawings  [10,  2,  23, 
41,  64,  66,  98,  106]  reqnires  text/graphics  separation,  and  classihcation  of  graphic  components 
as  lines  (dashed  or  hatched),  arcs,  dimensions,  etc. 

The  diversity  of  types  of  engineering  drawings  nsnally  means  that  a  particnlar  method  can 
be  only  applied  to  one  or  a  few  different  types.  Text  tonching  graphics  and  often  poor  qnality 
of  the  original  paper  docnment  are  the  main  problems. 

3.5  WWW  page  images  and  video  frames 

WWW  pages  and  video  frames  containing  text  are  a  relatively  new  class  of  docnments.  They 
appear  initially  as  electronic  docnments,  nnlike  the  others  described  previonsly.  These  docn¬ 
ments  are  typically  in  color  and  their  images  often  have  low  resolntion  in  order  to  keep  their 
sizes  small  for  fast  loading  and  aliased  for  perceptnal  clarity.  Their  main  elements  are  text  and 
pictnres  arbitrarily  located  within  a  docnment.  Text  in  WWW  pages  can  appear  either  in  a 
character  format  or  embedded  in  a  bitmap.  For  video  frames,  text  is  often  a  part  of  the  a  rather 
complex  image,  appearing  as  either  scene  or  graphic  text  [59].  Characters  are  often  rendered 
at  a  resolntion  3-4  times  lower  than  the  200-300  dpi  nsed  for  many  other  docnment  classes, 
making  detection  and  recognition  difficnlt. 

WWW  pages  and  video  frames  can  be  considered  nnstrnctnred  docnments  becanse  their 
layont  is  not  limited  to  predehned  rnles,  and  in  the  case  of  video,  it  is  often  impossible  to  set 
snch  rnles.  The  primary  task  here  is  to  locate  text  embedded  in  the  image  for  pnrposes  of 
information  retrieval  and  video  indexing  [If,  33,  39,  48,  58,  63,  73,  81,  108,  109]. 

The  main  problems  are  1)  the  colors  of  text  and  backgronnd  may  be  close  to  each  other  dne 
to  low  image  resolntion,  or  the  text  may  have  low  contrast  with  the  backgronnd  (for  example, 
the  text  embedded  in  a  complex  textnred  backgronnd),  so  that  thresholding  wiU  not  separate 
text  and  backgronnd;  2)  text  components  may  fall  on  a  cnrved  line,  or  in  the  worst  case,  they 
may  not  be  aligned  at  all  (wave-like  lines);  3)  text  components  may  have  non-nniform  color, 
and  may  be  fragmented  into  several  snbparts,  not  becanse  of  noise  bnt  becanse  of  their  design; 
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4)  effects  of  motion  in  video  sequences. 


4  The  state  of  the  art  of  page  segmentation  and  zone  classification  methods 

This  section  presents  the  state  of  the  art  of  page  segmentation  and  zone  classihcation  methods. 
Among  all  the  document  classes  mentioned  in  Section  3,  we  have  chosen  structured  articles  (in 
journals,  newspapers,  and  newsletters),  documents  with  unconstrained  layout  (advertisements, 
cover  and  title  pages  of  CDs,  books,  and  journals),  and  non-traditional  documents  such  as 
WWW  pages,  because  such  documents  can  be  described  as  pages  (either  paper  or  electronic) 
usually  consisting  of  different  zones.  Unlike  these,  line  drawings,  maps,  bank  checks  and  forms 
do  not  have  a  zone-like  structure  since  they  contain  a  mixture  of  text  and  non-text  elements, 
often  touching  or  intersecting. 

This  section  is  divided  into  three  subsections  reviewing  different  segmentation  and  zone 
classihcation  methods.  Here,  segmentation  means  image  partitioning  into  homogeneous  zones 
or  areas  each  containing  only  one  data  type  such  as  text  or  graphics.  These  zones,  however,  do 
not  have  class  labels  after  document  segmentation  so  zone  classihcation  is  required  to  assign 
labels  to  them  based  on  the  values  of  selected  features.  Many  methods  do  both  segmentation 
and  zone  classihcation  simultaneously.  Each  method  wiU  be  only  briehy  described  with  no 
details  on  its  implementation. 

4.1  Document  segmentation 

A  traditional  taxonomy  divides  document  segmentation  approaches  into  bottom-up  (data- 
driven),  top-down  (model-driven),  and  hybrid  (intermediate  between  the  bottom-up  and  top- 
down)  methods.  The  top-down  techniques  are  useful  and  fast  when  one  knows  particulars  of  a 
document’s  layout  a  priori.  In  this  case,  the  processing  begins  with  a  whole  page  at  the  highest 
level  that  is  then  divided  into  smaller  regions  such  as  blocks,  hues,  words,  and  characters.  The 
bottom-up  algorithms  start  from  pixels  and  group  them  into  connected  regions  that  are  then 
combined  into  larger  structures.  They  are  more  robust  with  respect  to  variations  in  document 
layout  but  often  slow.  The  hybrid  methods  occupy  an  intermediate  place  between  the  previous 
classes;  that  is,  they  often  try  to  combine  the  high  speed  of  the  top-down  methods  and  the 
robustness  of  the  bottom-up  methods.  It  is  also  possible  to  classify  segmentation  methods  into 
texture-  and  non-texture-based,  because  there  are  techniques  that  treat  the  document  regions 
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as  textures  of  different  classes.  See  [72]  for  a  survey  of  texture-based  methods. 

4.1.1  Bottom-up  methods 

The  methods  described  in  [24,  26,  42,  44,  46,  47,  53,  95]  are  based  on  component  analysis 
of  binary  images  and  grouping  of  the  components  into  characters,  lines,  and  blocks  by  using 
closeness  between  adjacent  components  and  their  sizes.  Smearing,  nearest  neighbor  search,  and 
Voronoi  diagrams  are  the  main  grouping  methods.  The  processing  times  vary  widely,  depending 
on  the  method.  The  methods  in  this  group  are  often  tolerant  to  skew  (sometimes  even  multiple 
skew)  and  arbitrary  document  layouts.  The  skew  can  also  be  computed  as  a  by-product  of  text 
line  extraction. 

If  fast  processing  or  generality  are  requirements,  the  choice  of  data  representation  is  crucial 
for  the  methods  in  this  group.  In  [95]  aU  regions  are  represented  in  a  hierarchical  tree  structure 
for  processing  various  document  types,  such  as  forms  or  journals,  containing  EngUsh  and  Kanji 
characters.  The  block  adjacency  graph  provides  very  fast  processing  in  [46,  47]. 

Iterative  connectivity  analysis  of  pre-classihed  square  blocks  together  with  a  number  of 
carefully  selected  small  masks  forms  the  larger  regions  in  [78].  Connectivity  at  the  block  level 
can  dramatically  reduce  the  processing  time  in  comparison  with  pixel  connectivity. 

The  method  in  [75]  extracts  information  about  the  co-occurrence  of  pixel  values  within  a 
5x5  window  centered  at  each  pixel  by  scanning  the  image  row  by  row.  The  nearest  neighbor 
clustering  technique  is  then  used  to  merge  the  pixels  into  homogeneous  regions.  The  advantage 
of  this  method  is  that  it  can  process  both  binary  and  gray-scale  images,  ft  is  not  tolerant  to 
document  skew  or  arbitrary  layout. 

Approaches  based  on  the  background  analysis  of  binary  images  are  considered  in  [5,  7,  12, 
70,  74].  The  most  advanced  method  [5,  7]  employs  so-caUed  white  tiles,  ft  hrst  creates  a  net  of 
rectangles,  each  representing  a  widest  rectangular  area  of  white  (background)  space,  and  then 
traces  through  these  rectangles  to  identify  the  region  contours.  The  method  uses  a  flexible 
data  representation  that  can  represent  both  Manhattan  and  arbitrary  layouts  without  splitting 
a  complex- shaped  region,  ft  does  not  require  prior  skew  correction  and  can  process  locally 
skewed  text  regions  with  different  orientations,  ft  is  also  fast,  because  pixel-based  processing  is 
performed  only  once. 

Morphological  operations  (opening,  closing)  are  applied  in  [29,  30,  49]  to  group  pre-classihed 
pixels  into  larger  regions.  Image  smearing  using  the  Run  Length  Smoothing  (RLS)  method  is 
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employed  in  [31,  32,  34,  56,  80].  This  operation  is  similar  to  a  directional  morphological  dilation. 
Some  methods  of  this  gronp  [29,  30,  32]  are  tolerant  to  docnment  skew  or /and  arbitrary  layont, 
while  others  [31,  34,  49,  56]  seem  not  to  be. 

Text  extraction  from  complex  color  docnment  images  is  treated  in  [17,  48,  68,  73,  108- 
110].  Many  of  these  methods  assnme  that  characters  have  a  nniform  color  and  are  arranged  in 
horizontal  lines,  and  the  colors  of  text  and  backgronnd  are  well  separated.  The  method  in  [17] 
processes  images  of  technical  jonrnal  covers.  It  consists  of  color  qnantization  for  redncing  the 
nnmber  of  colors  followed  by  edge-  and  color-based  segmentation.  The  method  in  [48]  extracts 
text  by  nsing  mnltivalned  image  processing;  it  can  be  applied  to  advertisement  images,  WWW 
images,  CD  and  book  cover  images,  and  video  frames.  A  mnltivalned  image  may  be  a  binary 
image  (advertisement),  gray-scale  image,  psendo-color  image  (WWW  image  in  GIF  format), 
or  fnll-color  image  (video,  book,  or  CD  cover).  It  is  decomposed  into  mnltiple  foregronnd  and 
backgronnd-complementary  foregronnd  images  so  that  connected  component  analysis  can  be 
nsed  to  identify  text  components. 

The  method  in  [68]  appears  to  have  several  signihcant  advantages  over  the  others:  it  does 
not  depend  on  text  line  skew,  and  it  can  process  correctly  merged  characters  and  even  adjacent 
lines.  The  method  hrst  detects  connected  components  in  one  or  several  binary  images.  The 
nnmber  of  images  depends  on  the  image  type  (binary,  gray-scale  or  color).  For  example,  for 
binary  images,  the  connected  components  are  detected  on  two  images  that  have  a  positive  and 
negative  text  contrast  with  respect  to  the  backgronnd.  Text  Unes  are  extracted  by  means  of 
a  hierarchical  divisive  procednre  employing  a  nnmber  of  henristics.  Gray-scale  images  of  book 
covers  were  chosen  for  demonstration  of  this  method’s  performance,  thongh  it  seems  that  it  can 
treat  other  docnment  classes  as  well. 

The  method  in  [73]  analyzes  WWW  images,  ft  hrst  rednces  the  nnmber  of  colors  by  color 
qnantization  followed  by  connected  component  detection  for  each  remaining  color.  After  that, 
character  candidates  are  selected  by  thinning  the  connected  components  and  by  compnting 
their  width  and  height  dnring  this  process.  The  anthors  snggest  that  characters  are  composed 
of  a  set  of  strokes  that  have  relatively  hxed  width  and  ratio  between  width  and  height.  Strokes 
meeting  these  criteria  are  assnmed  to  belong  to  characters  and  are  nsed  as  inpnt  for  a  potential 
held  approach  that  gronps  adjacent  characters  into  Unes.  Like  [68],  this  method  is  able  to  detect 
text  of  arbitrary  orientation  and  even  cnrved  text  lines. 

The  paper  [109]  proposes  two  methods  of  text  location  in  complex  images  of  book  and  CD 
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covers  and  video  frames.  The  first  method  processes  color  images  and  nses  a  color  histogram 
to  hnd  a  set  of  dominant  colors  in  the  image.  Then  connected  components  are  extracted  for 
each  color  and  several  henristics  ntilizing  the  size,  alignment,  and  proximity  of  text  components 
are  applied  to  obtain  candidate  characters.  The  second  method  (it  is  also  described  in  [108]) 
works  on  gray-scale  images  and  employs  the  difference  in  the  spatial  variance  between  text  and 
backgronnd.  This  featnre  is  higher  for  text  lines  than  for  backgronnd,  so  that  text  lines  can 
be  identihed  by  two  sharp  changes  in  valnes  of  the  spatial  variance  (from  low  to  high  and  from 
high  to  low).  The  spatial  variance  is  compnted  for  each  pixel  over  a  local  neighborhood  of  1  X  iV 
pixels  in  the  horizontal  direction.  Then  horizontal  edges  are  detected  on  the  variance  image 
by  a  Canny  edge  detector,  and  they  are  farther  merged  into  longer  lines.  Pairs  of  adjacent 
lines  with  opposite  orientations  are  gronped  together  by  nsing  simple  henristics  in  order  to 
locate  text  bonnding  boxes.  This  method  mnst  be  modihed  somewhat  if  it  is  applied  to  color 
images;  color,  not  gray-scale  edge  detection  has  to  be  nsed.  Both  methods  are  qnite  robnst  to 
variations  in  font,  color,  and  text  size.  A  hybrid  method  combining  the  two  previons  methods 
is  also  presented,  ft  works  better  in  cases  where  neither  method  alone  can  locate  text  regions. 
The  method  in  [110]  hrst  qnantizes  the  color  space  of  an  inpnt  image  into  color  classes  nsing 
a  EncUdean  minimnm  spanning  tree.  Then  the  text-Uke  connected  components  in  each  color 
class  are  identihed  and  gronped  into  horizontal  lines  nsing  a  set  of  henristics. 

4.1.2  Top-down  methods 

A  common  featnre  of  the  top-down  methods  described  below  is  global-to-local  processing.  How¬ 
ever,  this  may  mean  several  qnite  different  things: 

•  processing  starts  from  a  whole  image  and  then  descends  to  smaller  blocks  [3,  35], 

•  a  global  transformation  is  applied  to  a  whole  image  and  then  pixels  of  the  transformed 
image  are  gronped  into  clnsters  [45], 

•  processing  can  go  from  a  coarse  to  a  hne  image  representation  [16,  20],  where  large  re¬ 
gions  are  hrst  qnickly  extracted  in  a  coarse  representation  and  are  then  rehned  in  a  hner 
representation. 

A  projection  prohle  analysis  that  connts  the  nnmber  of  pixels  along  a  given  direction  is 
one  of  the  best  known  top-down  techniqnes.  ft  is  apphed  to  binary  images  with  no  skew  and 
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recursively  divides  an  image  into  smaller  regions  based  on  valleys  in  vertical/horizontal  profile 
histograms,  which  correspond  to  visual  separators  between  rectangular  blocks.  Pixel  projection 
prohles  are  utilized  in  [.3],  while  prohles  of  bounding  boxes  are  employed  in  [35]  to  speed  up 
processing. 

A  global  image  transformation  using  a  hlter  bank  of  several  orientation-selective  2-D  Gabor 
hlters  is  applied  in  [45]  to  a  gray-scale  image  followed  by  pixel  clustering  on  the  transformed 
image  with  a  squared-error  algorithm  to  detect  text  and  non-text  (background,  picture)  regions. 
This  method  is  tolerant  to  skew  but  very  slow. 

Document  segmentation  of  gray-scale  images  based  on  four  Gaussian  pyramids  each  with 
four  levels  is  introduced  in  [20].  The  pyramids  represent  four  feature  maps,  where  the  features 
are  the  average,  variance,  threshold,  and  median.  Processing  is  done  from  low  (less  detailed) 
to  high  (more  detailed)  image  resolution.  This  method  is  skew-  and  layout-independent. 

Text  area  detection  on  a  textured  background  is  an  issue  in  [16],  where  a  texture-based 
approach  is  used  that  consists  of  feature  extraction  with  Laws’  masks,  coarse  classihcation  of 
8  X 8-pixel  non-overlapping  blocks  as  text,  background,  and  “fuzzy”  (boundary  between  text  and 
background),  and  hne  text  segmentation  of  “fuzzy”  blocks  at  the  pixel  level.  Both  coarse  and 
hne  segmentation  rely  on  stationary  HMMs  using  from  4  to  8  states.  The  advantage  of  HMMs 
over  many  neural  network  training  procedures  is  that  each  model  is  trained  independently  of 
the  others — that  is,  when  the  data  of  a  new  class  is  added,  a  new  HMM  is  created  and  trained 
only  on  the  samples  of  that  class  (the  other  HMMs  do  not  have  to  be  retrained).  The  method 
is  not  sensitive  to  text  skew,  document  layout,  or  script  type. 

4.1.3  Hybrid  and  other  methods 

In  [93]  a  bottom-up  RLS  is  applied  to  a  binary  image  to  detect  text  lines  and  non-text  data, 
followed  by  top-down  recursive  X-Y  cuts  (RXYC)  that  combine  the  separate  text  lines  into 
blocks.  This  method  is  simple  to  implement  and  quite  fast,  but  it  can  only  analyze  rectangular 
layouts  and  requires  prior  binarization  to  process  gray-scale  images. 

Adaptive  split-and-merge  segmentation  of  gray-scale  document  images  into  homogeneous 
regions  represented  as  leaves  of  a  quadtree  is  developed  in  [62].  Splitting  and  merging  are 
performed  at  the  same  time.  If  some  region  is  inhomogeneous,  it  is  split  into  four  rectangular 
subregions  by  thresholding  based  on  projection  prohles.  If  two  adjacent  regions  (they  may  or 
may  not  be  at  the  same  segmentation  level)  are  homogeneous  and  their  union  is  also  homoge- 


14 


neous,  they  are  merged.  The  mean  value  and  variance  of  pixel  intensities  in  each  region  are 
used  for  making  decisions  about  merging  and  splitting.  The  document  layout  can  be  arbitrary, 
but  prior  skew  estimation  and  correction  is  necessary.  This  method  can  be  appUed  to  images 
of  magazines,  bank  checks  with  complex  backgrounds,  and  table-Uke  documents.  A  similar 
technique  is  used  in  [86]. 

The  method  in  [84]  extracts  text  from  complex  color  images  of  book  and  journal  covers 
by  using  a  hybrid  analysis.  The  top-down  technique  recursively  splits  the  image  into  rect¬ 
angular  blocks,  and  splitting  terminates  when  there  are  pixels  of  at  least  two  different  colors 
inside  a  given  block.  Homogeneous  blocks  are  considered  to  belong  to  the  uniform  background, 
while  non-homogeneous  blocks  containing  at  least  two  different  colors  correspond  to  text.  The 
bottom-up  technique  detects  homogeneous  regions  of  arbitrary  shape  by  utilizing  a  region  grow¬ 
ing  method.  The  results  of  both  techniques  are  combined  in  order  to  verify  whether  a  given 
region  is  text  or  not  by  assuming  that  text  is  horizontally  aUgned. 

A  method  that  does  not  belong  to  either  group  is  presented  in  [88].  Gray-scale  image 
segmentation  is  based  on  the  fractal  signature,  which  is  less  for  the  background  than  for  image 
blocks.  The  method  is  not  iterative  and  aU  processing  is  done  in  one  step.  This  approach  can 
be  applied  to  documents  that  have  complex  layouts. 

4.2  Document  zone  classification 

A  large  number  of  methods  [3,  26,  42,  46,  48,  68,  95,  109,  110]  uses  features  of  connected 
components  to  separate  text  and  non-text  in  binary  images.  These  features  include  the  sizes  of 
a  connected  component  and  of  its  nearest  neighbors,  alignment,  proximity,  and  elongation. 

Texture  run-length  statistics  based  on  the  occurrences  of  the  black  and  white  pixels  within 
each  segmented  region  are  employed  in  [80,  93].  Run-length  statistics  calculated  for  four  di¬ 
rections  (horizontal,  vertical,  left  diagonal  and  right  diagonal)  and  classihcation  based  on  a 
decision  tree  are  proposed  in  [83].  Textural  features  of  white  tiles  are  also  used  in  [6]. 

In  [74]  the  classihcation  utilizes  cross-correlation  between  adjacent  scanlines  and  the  de¬ 
pendence  of  its  behavior  on  the  interUne  distance.  A  combination  of  run-length  features  and 
cross-correlation  between  pixels  is  used  in  [78]. 

In  [31,  32]  the  vertical  projection  prohle  separates  text  from  non-text  (it  is  periodic  for  text), 
while  the  black  pixel  distribution  helps  to  discriminate  graphics  and  pictures  (it  is  sparse  for 
the  former  and  dense  for  the  latter). 
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Gray  level  histograms  computed  at  several  levels  of  image  resolution  distinguish  between 
background,  text,  graphics,  and  pictures  in  [20]. 

Classihcation  using  soft  computation  techniques  is  described  in  [29,  30,  49,  56].  Usually 
a  neural  network  (multilayer  Perceptron  in  many  cases)  is  hrst  trained  on  samples  of  text, 
graphics,  and  pictures  and  then  is  used  for  classihcation.  Block  sizes  and  run- length  features 
are  inputs  to  a  neural  network  in  [56].  Low-order  moments  of  wavelet  packet  components,  which 
are  computed  over  small  windows,  are  the  features  used  for  neural  classihcation  in  [29,  30].  To 
make  classihcation  more  reliable  for  each  window,  fuzzy  integration  of  decisions  obtained  from 
neighboring  windows  is  carried  out.  Decisions  are  integrated  at  several  scales  of  image  resolution 
and  within  each  scale  as  well.  In  [49]  a  neural  network  learns  a  small  set  of  masks  which  best 
discriminate  between  text,  background,  line  drawings,  and  pictures.  Convolving  these  masks 
with  the  input  image  produces  texture  features  that  are  used  for  classihcation  of  each  image 
pixel  by  the  neural  network  into  one  of  three  classes  (text -f line  drawings,  halftone  pictures, 
and  background).  The  regions  belonging  to  the  hrst  class  are  further  binarized  with  a  hxed 
global  threshold  and  separated  by  the  size  of  the  connected  component.  This  method  is  robust 
to  different  languages  and  can  discriminate  between  the  text  of  languages  such  as  English  and 
Chinese.  The  methods  in  [29,  30,  49]  are  different  from  the  others,  because  window/pixel 
classihcation  is  performed  before  segmentation  into  regions. 

4.3  Literature  comparison 

In  this  subsection,  we  present  a  comparison  of  the  properties  and  performance  of  different  page 
segmentation  and  zone  classihcation  methods.  The  results  are  given  in  Tables  f  and  2. 

In  Table  f,  ‘Ref.’  refers  to  a  given  method.  ‘Document  type’  refers  to  the  classes  of 
document  images  processed  by  the  method.  ‘Image  type’,  B,  G,  and  C  correspond  to  binary, 
gray-scale,  and  color  images,  respectively.  ‘Background’  can  be  uniform  U  (usually  white  or 
black),  T  (textured),  or  C  (color).  ‘Layout’  can  be  R  (rectangular),  M  (Manhattan),  or  A 
(arbitrary).  ‘Skew’  indicates  whether  a  method  is  tolerant  to  some  degree  of  skew  (Yes)  or  not 
(No).  ‘Test  set’  and  ‘AC  %’  refer  to  the  test  set  size  and  accuracy  of  a  given  method.  The 
symbol  ‘-’  means  that  the  given  feature  was  not  mentioned  in  the  original  paper.  The  symbol 
‘+’  indicates  that  the  results  are  not  described  by  a  single  digit,  but  several  criteria  are  used 
such  as  fragmentation  and  over-merging  rates.  It  is  worth  pointing  out  that  some  methods  can 
be  more  or  less  easily  modified;  their  properties  will  then  be  changed. 
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In  Table  2,  ‘Time’  corresponds  to  the  processing  time  in  seconds  on  a  compnter  ‘Platform’, 
‘Image  size’  refers  to  image  size,  while  ‘Res.’  stands  for  image  resolntion.  Image  sizes  and 
resolntions  are  given  in  pixels/paper  page  format  and  dpi,  respectively.  The  notations  are  the 
same  as  those  in  Table  1. 

Table  1:  The  most  important  properties  of  docnment  layont  analysis  methods. 


Ref. 

Document  type 

Image 

Back¬ 

Lay¬ 

Skew 

Test 

AC 

type 

ground 

out 

set 

% 

[3] 

Jonrnals,  newspapers 

B 

U 

R 

No 

33 

94.8- 

97.2 

[7] 

Jonrnals,  newspapers. 

B 

U 

A 

Yes 

40 

+ 

newsletters,  advertisements 

[16] 

Unknown 

G 

T 

A 

Yes 

- 

- 

[17] 

Jonrnal  covers 

C 

C 

A 

No 

100 

95.2- 

98 

[20] 

Jonrnals 

G 

u 

A 

Yes 

100 

- 

[26] 

Jonrnals,  bnsiness  cards. 

B 

u 

A 

No 

30 

93- 

technical  reports 

100 

[30] 

Jonrnals 

G 

u 

A 

Yes 

- 

- 

[31] 

Jonrnals 

B 

u 

R 

No 

- 

- 

[32] 

Jonrnals 

B 

u 

R 

Yes 

30 

92- 

97 

[34] 

Newspapers 

B 

u 

M 

No 

100 

96 

[35] 

Jonrnals 

B 

u 

R 

No 

150 

- 

[42] 

Advertisements,  letters,  en¬ 

B 

u 

A 

Yes 

150 

- 

velopes 

[45] 

Newspapers 

G 

u 

A 

Yes 

- 

- 

[46] 

Jonrnals 

B 

u 

M 

No 

150 

91.1- 

99.4 
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Table  1  continued 


Ref. 

Document  type 

Image 

Back¬ 

Lay¬ 

Skew 

Test 

AC 

type 

ground 

out 

set 

% 

[48] 

Advertisements, 

B, 

U,  C 

A 

No 

26 

99.2 

WWW  images, 

G, 

54 

97.6 

book  covers. 

C 

30 

72.0 

video 

6,952 

94.7 

[49] 

Journals 

G 

u 

M 

No 

- 

- 

[53] 

Journals,  newspapers 

B 

u 

A 

Yes 

114 

+ 

[56] 

Journals 

B 

u 

R 

No 

50 

98.18- 

99.61 

[62] 

Journals,  forms,  bank  checks 

G 

U,  T 

A 

No 

- 

- 

[68] 

Book  covers 

G 

u 

A 

Yes 

100 

91.2 

[70] 

Journals 

B 

u 

A 

Yes 

- 

- 

[73] 

WWW  pages 

C 

c 

A 

Yes 

200 

88.8- 

92 

[74] 

Journals 

B 

u 

R 

Yes 

- 

- 

[75] 

Journals 

B,  G 

u 

M 

No 

- 

- 

[78] 

Journals 

B 

u 

R 

No 

- 

99- 

100 

[80] 

Journals,  newspapers 

B 

u 

R 

No 

100 

- 

[83] 

Journals 

B 

u 

R 

No 

979 

97 

[84] 

Book  and  journal  covers 

C 

c 

A 

No 

16 

- 

[88] 

Journals 

G 

u 

A 

No 

- 

- 

[93] 

Newspapers 

B 

u 

R 

No 

- 

78- 

100 

[95] 

Journals,  forms 

B 

u 

A 

Yes 

- 

- 

[109] 

Book  and  CD  covers,  video 

C 

c 

A 

No 

- 

- 

[110] 

WWW  images 

C 

c 

A 

No 

262 

90 
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Table  2:  Properties  related  to  the  processing  time  for  document  analysis  methods. 


Ref. 

Document  type 

Time 

Image  size 

Res. 

Platform 

[3] 

Journals,  newspapers 

80 

A4 

400 

- 

[7] 

Journals,  newspapers, 

newsletters,  advertisements 

0.55 

810  X  1151 

100 

HP  9000/735 

[7] 

Journals,  newspapers, 

newsletters,  advertisements 

1.2 

1215  X  1727 

150 

HP  9000/735 

[7] 

Journals,  newspapers, 

newsletters,  advertisements 

6.5 

2431  X  3455 

300 

HP  9000/735 

[17] 

Journal  covers 

95 

2000  X  2679 

- 

Pentium  100 

[17] 

Journal  covers 

180 

1719  X  2476 

- 

Pentium  100 

[26] 

Journals,  business  cards, 

technical  reports 

4.8 

— 

— 

PC  486 

[30] 

Journals 

22 

A4 

300 

Sun  Sparc  20 

[34] 

Newspapers 

?s9 

6592  X  9890 

— 

Pentium  350 

II 

[35] 

Journals 

k2 

Letter-sized 

300 

Sun  Sparc  10 

[42] 

Advertisements,  envelopes, 

letters 

306.9- 

563.6 

<A4 

300 

Sun  Sparc 

IPX 

[45] 

Newspapers 

?sl20 

512  X  512 

75 

Sun  Sparc  2 

[46] 

Journals 

1.3 

A4 

300 

SG  Indigo 

[48] 

Advertisements 

0.15 

548  X  769 

150 

Sun  Ultra- 

Sparc  I 

[48] 

WWW  images 

0.11 

385  X  2.34 

— 

Sun  Ultra- 

Sparc  I 

[48] 

Book  covers 

0.4 

763  X  537 

50 

Sun  Ultra- 

Sparc  I 

[48] 

Video  frames 

0.09 

160  X  120 

— 

Sun  Ultra- 

Sparc  I 
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Table  2  continued 


Ref. 

Document  type 

Time 

Image  size 

Res. 

Platform 

[49] 

Jonrnals 

60-85 

264  X  332  to 

100 

Snn  Sparc  20 

780  X  1080 

[53] 

Jonrnals,  newspapers 

2.93 

1053  X  1149 

90 

Pentinm  200 

Pro 

[53] 

Jonrnals,  newspapers 

5.37- 

2592  X  3300- 

300 

Pentinm  200 

7.03 

3114  X  3554 

Pro 

[68] 

Book  covers 

0.01- 

512  X  512 

- 

Pentinm  200 

2.86 

Pro 

[70] 

Jonrnals 

20 

1278  X  1746 

- 

Snn  Sparc  2 

[74] 

Jonrnals 

0.9-1.9 

A4 

300 

Snn  Sparc 

[80] 

Jonrnals,  newspapers 

2.37 

A4 

300 

Snn  Sparc 

[84] 

Book  and  jonrnal  covers 

21.31 

1600  X  2400 

200 

Snn  Ultra 

Sparc  5/10 

[93] 

Newspapers 

2.6 

- 

100 

Snn  3/60 

[93] 

Newspapers 

9.5 

- 

200 

Snn  3/60 

[109] 

Book  and  CD  covers,  video 

5.5-6 

256  X  256 

- 

Snn  Sparc  20 

5  Analysis  of  the  methods 

In  this  section  we  try  to  generalize  onr  ideas  abont  docnment  layont  analysis  techniqnes.  Al- 
thongh  some  anthors  [42,  44,  48,  62,  68,  95,  109]  state  that  their  methods  are  applicable  to 
a  variety  of  docnment  classes  and  image  types,  it  seems  that  there  is  no  generic  solntion  to 
the  docnment  segmentation  and  classihcation  problem,  becanse  the  broader  a  task  is,  the  less 
manageable  it  is,  the  more  parameters  have  to  be  adjnsted,  and  the  less  predictable  the  resnlts. 

This  can  be  seen  especially  for  the  methods  developed  for  text  extraction  from  complex 
gray-scale  or  color  images.  UsnaUy  snch  methods  rely  mnch  more  on  henristics  than  those 
nsed  for  other  classes.  A  common  featnre  of  these  methods  is  that  a  nnmber  of  conditions 
mnst  be  satished,  snch  as  1)  nniformity  of  character  color,  2)  weU-separated  colors  of  text  and 
backgronnd,  .3)  the  characters  shonld  form  a  straight  horizontal  line  (the  last  condition  becomes 
nnnecessary  for  the  recently  proposed  methods  [37,  58,  68,  73]). 
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Text  extraction  from  color  images  does  not  nse  OCR  to  verify  the  extraction  resnlts.  Appli¬ 
cation  of  OCR  to  this  task  might  sometimes  rednce  the  nnmber  of  henristics  nsed  and  improve 
the  accnracy. 

The  layont  analysis  methods  applied  to  images  of  jonrnals,  newspapers  and  other  text- 
dominated  docnments  can  be  divided  into  two  gronps.  The  methods  in  the  hrst  gronp  reqnire 
skew  estimation  and  correction  before  layont  analysis.  These  methods  are  typically  applied 
to  images  with  a  rectangnlar  or  Manhattan  layont,  becanse  skew  correction  for  snch  images 
signihcantly  facilitates  layont  analysis.  However,  errors  in  skew  detection  will  degrade  the 
accnracy  of  a  layont  analysis  method  if  it  cannot  operate  on  skewed  regions.  On  the  other  hand, 
another  gronp  of  methods  hrst  segment  and  classify  the  image  into  regions  and  then  estimate 
and  correct  the  skew  of  text  regions.  UsnaUy  snch  methods  work  with  complex  and  arbitrary 
layonts  or  with  mnltiple  skew.  In  the  latter  case,  skew  estimation  applied  to  all  text  blocks 
may  not  be  nsefnl.  Sometimes  these  methods  avoid  extra  processing,  becanse  skew  estimation 
is  not  done  for  non-text  regions.  However,  layont  analysis  seems  to  be  a  more  difficnlt  task 
than  skew  estimation,  and  it  is  not  always  easy  to  do  it  as  accnrately  as  skew  estimation.  This 
means  that  errors  in  complex  layont  analysis  may  appear  more  freqnently  than  those  in  skew 
estimation,  resnlting  in  postprocessing  after  skew  estimation.  The  hnal  choice  depends  on  the 
application  and  the  complexity  of  the  problem. 

The  initial  image  resolntion  is  different  in  different  applications.  For  example,  it  is  low 
(72  dpi)  for  WWW  images  [110],  while  it  can  be  qnite  high  (np  to  .300  dpi)  for  jonrnal  or 
newspaper  images.  Processing  at  low  resolntion  resnlts  in  fast  compntation,  bnt  hne  docnment 
details  are  corrnpted  and  this  may  (thongh  does  not  always)  negatively  inhnence  other  steps 
snch  as  text  segmentation  or  OCR.  The  choice  of  a  proper  resolntion  is  therefore  not  a  simple 
task  becanse  it  involves  analysis  of  many  of  the  docnment  processing  steps. 

Data  strnctnres  are  of  great  importance  for  layont  analysis  methods.  They  can  dramatically 
rednce  the  processing  time  and  determine  how  easily  the  data  can  be  accessed  in  the  operations 
after  layont  analysis.  Examples  of  flexible  and  efficient  strnctnres  are  the  BAG  [46,  47],  white 
tile  [5-7],  qnadtree  [62],  and  sqnare  block  tessellation  [16,  29,  30,  78].  A  good  data  representation 
not  only  provides  not  only  easy  access  to  the  data,  bnt  often  resnlts  in  skew-  and/or  layont- 
independence. 

There  have  been  only  a  few  applications  of  soft  compnting  techniqnes  to  docnment  layont 
analysis  [29,  30,  49,  56].  The  paper  [56]  describes  comprehensive  research  nsing  several  popnlar 
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neural  networks  for  document  classification.  However,  the  number  of  training  samples  can  be 
very  large  (up  to  1,000,000  in  [49])  and  it  is  unclear  how  many  samples  are  needed  because  of 
large  inter-class  and  intra-class  variations. 

The  papers  [16,  49,  62]  demonstrate  the  effectiveness  of  document  layout  analysis  methods 
when  solving  very  difficult  tasks  of  text  extraction  from  a  complex  textured  background  [16,  62] 
and  separation  of  different  languages  within  the  same  image  [49].  Researchers  in  document 
image  analysis  have  not  yet  paid  much  attention  to  these  tasks. 

Use  of  the  background  can  greatly  help  segmentation,  because  it  is  a  natural  separator 
between  different  regions.  The  image  resolution  can  often  be  reduced  to  75-100  dpi  before 
processing,  ft  is  also  better  to  combine  document  segmentation  with  document  classihcation; 
this  saves  much  processing  time  because  the  data  need  to  be  accessed  only  once. 

Processing  time  and  accuracy  are  important  features  of  document  layout  analysis  methods. 
Processing  time  varies  widely:  from  ~1  s  to  several  minutes  per  image.  We  have  found  that 
fast  methods  such  as  [7,  53]  which  are  simultaneously  skew-  and  layout -independent  take  ap¬ 
proximately  5.5-7  s  to  process  a  binary  image  at  300  dpi.  Faster  methods  [46,  47,  74]  take  1-2  s 
to  do  this,  but  in  this  case,  either  prior  skew  correction  is  necessary  or  document  layout  cannot 
be  arbitrary.  Fast  computation  may  be  more  important  for  information  retrieval,  such  as  text 
extraction  from  WWW  images  or  video,  than  for  the  analysis  of  journal  and  newspaper  images, 
which  can  be  more  or  less  interactive.  In  the  latter  case,  the  user  often  has  more  freedom  to 
edit  the  results  without  new  parameter  settings  than  in  the  former  case,  where  processing  with 
new  settings  is  necessary  if  the  previous  results  are  not  satisfactory.  Thus  how  hue  parameter 
settings  should  be  depends  on  the  particular  application. 

ft  is  quite  difficult  to  compare  the  accuracies  of  different  methods,  because  often  they  are 
tested  on  different  data  sets  with  different  initial  conditions.  In  many  cases,  except  for  the 
analysis  of  pure  text  images,  the  results  are  only  visually  evaluated  and  objective  performance 
evaluation  is  often  absent.  Moreover,  there  is  no  unique  dehnition  of  an  accuracy  measure.  For 
example,  it  can  be  the  ratio  of  the  number  of  regions/pages  correctly  segmented  and  classihed 
to  the  total  number  of  regions/pages,  or  it  can  be  expressed  by  the  number  of  cases  where 
regions  are  erroneously  split  or  merged.  Splitting  of  a  region  containing  data  of  the  same  class 
into  several  subregions  may  often  be  more  easily  corrected  at  later  steps  than  merging  of  regions 
belonging  to  different  classes,  which  is  not  easy  to  detect.  The  accuracy  varies  from  70  to  almost 
100  percent  for  various  methods  and  various  document  classes,  ft  is  higher  for  binary  images 
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of  journals  and  newspapers  and  lower  for  color  WWW  and  video  images. 

G round- 1 rut hing  and  benchmarking  of  document  layout  analysis  methods  is  not  completely 
solved.  Many  methods  are  claimed  to  be  skew-invariant,  but  it  is  very  difficult  to  verify  their 
performance  on  a  large  data  set  of  skewed  images,  because  ground  truth  is  usually  created 
only  for  upright  images  without  skew.  Currently  available  benchmarking  systems  evaluate 
performance  based  on  OCR  results  [51]  (here  it  is  not  clear  whether  a  mistake  is  due  to  bad 
segmentation  or  to  incorrect  character  recognition),  or  they  need  separate  ground  truth  for 
each  skewed  image  [8,  52,  102,  103].  In  the  latter  case,  an  image  has  to  be  scanned  at  all 
possible  skew  angles,  and  after  that  ground-truth  is  generated  for  each  of  them.  Automated 
and  accurate  ground- 1  rut  hing  is  very  time-consuming  (up  to  5-10  min  per  image  at  the  pixel 
level  as  reported  in  [102,  103],  and  up  to  5  min  per  image  at  the  bounding  box  level  as  reported 
in  [52]).  Therefore,  if  one  needs  to  evaluate  the  performance  of  a  method  using  only  one  image 
skewed  at  aU  angles  from  1  to  100  degrees  in  1  degree,  steps,  it  is  necessary  to  scan  this  image 
100  times  at  different  angles.  In  this  case,  ground- 1  rut  hing  could  take  many  hours!  That  is 
why  many  authors  prefer  the  old  method  of  evaluation — their  own  visual  perception.  However, 
promising  results  on  evaluation  of  layout  analysis  algorithms  and  ground  truth  generation  have 
begun  to  appear  [8,  50,  52].  The  method  in  [52]  can  be  especially  useful  for  ground  truth 
generation  because  it  allows  one  to  do  it  automatically,  but  it  seems  that  it  can  be  primarily 
applied  to  text  documents  without  pictures  or  graphics. 

In  practice,  it  would  be  better  to  use  specialized  algorithms  for  different  types  of  documents 
and  tasks  in  order  to  get  optimal  performance.  The  following  features  would  be  useful  in  any 
case: 

•  tolerance  to  skew  (uniform  and  non-uniform), 

•  layout  independence, 

•  text  extraction  both  on  white  and  on  inverse  backgrounds, 

•  easy  access  to  data, 

•  fast  speed  and  high  accuracy, 

•  independence  of  font  type/size  and  script. 
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6  Conclusions 


This  survey  has  overviewed  the  state  of  the  art  in  document  layout  analysis  and  has  described 
the  progress  in  this  area  primarily  in  the  1990’s.  First,  we  presented  a  brief  analysis  of  the  tasks 
needed  for  various  document  classes.  Then  we  focused  in  more  detail  on  three  groups:  1)  struc¬ 
tured  articles,  2)  documents  with  unconstrained  layout,  and  3)  non-traditional  documents.  In 
addition  to  describing  the  methods  used  for  these  groups,  we  considered  such  features  as  im¬ 
age  and  background  type,  image  resolution,  processing  time,  tolerance  to  skew,  and  different 
layouts. 

Despite  intensive  research  in  this  area,  there  is  still  no  general  method  of  processing  the 
images  of  different  document  classes  both  accurately  and  automatically.  Important  features  that 
such  a  method  can  have  are  skew-,  layout-,  and  script-independence,  fast  speed,  high  accuracy, 
and  flexible  image  representation  enabling  easy  access  to  the  data.  Proper  formalization  of 
notions  of  “graphic”  and  “picture”,  automatic  ground-truth  generation,  benchmarking  of  layout 
analysis  methods  for  binary /gray-scale  images,  and  more  automated  and  less  heuristic-based 
color  image  processing  ought  to  be  among  the  goals  of  future  research. 
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